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Abstract 

Classifier evasion consists in finding for a given 
instance x the “nearest” instance x' such that the 
classifier predictions of x and x' are different. 
We present two novel algorithms for systemati¬ 
cally computing evasions for tree ensembles such 
as boosted trees and random forests. Our first 
algorithm uses a Mixed Integer Linear Program 
solver and finds the optimal evading instance un¬ 
der an expressive set of constraints. Our second 
algorithm trades off optimality for speed by us¬ 
ing symbolic prediction, a novel algorithm for 
fast finite differences on tree ensembles. On a 
digit recognition task, we demonstrate that both 
gradient boosted trees and random forests are 
extremely susceptible to evasions. Finally, we 
harden a boosted tree model without loss of pre¬ 
dictive accuracy by augmenting the training set 
of each boosting round with evading instances, a 
technique we call adversarial boosting. 

1. Introduction 


performance learning algorithm to generalize well and be 
hard to evade: only a “large enough” perturbation 6 should 
be able to alter its decision. The existence of small-(5 evad¬ 
ing instances shows a defect in the generalization ability of 
the model, and hints at improper model class and/or insuffi¬ 
cient regularization. Second, machine learning is becoming 
the workhorse of security-oriented applications, the most 
prominent example being unwanted content filtering. In 
those applications, the attacker has a large incentive for 
finding evading instances. For example, spammers look 
for small, cost-effective changes to their online content to 
avoid detection and removal. 

While prior work extensively studies the evasion problem 
on differentiable models by means of gradient descent, 
those results are reported in an essentially qualitative fash¬ 
ion, implicitly defaulting the choice of metric for measur¬ 
ing 6 to the 1/2 norm. Further, non-differentiable, non- 
continuous models have received very little attention. Tree 
sum-ensembles as produced by boosting or bagging are 
perhaps the most important models from this class as they 
are often able to achieve competitive performance and en¬ 
joy good adoption rates in both industrial and academic 
contexts. 


Deep neural networks (DNN) represent a prominent suc¬ 
cess of machine learning. These models can successfully 
and accurately address difficult learning problems, includ¬ 
ing classification of audio, video, and natural language pos¬ 
sible where previous approaches have failed. Yet, the ex¬ 
istence of evading instances for the current incarnation of 
DNNs ( [Szegedy et aH 2013| ) shows a perhaps surprising 
brittleness: for virtually any instance x that the model clas¬ 
sifies correctly, it is possible to find a negligible perturba¬ 
tion 6 such that x^6 evades being correctly classified, that 
is, receives a (sometimes widely) inaccurate prediction. 


The general study of the evasion problem matters on both 
conceptual and practical grounds. First, we expect a high- 


An abridged version of this paper appears in Proceedings of the 
33'^^ International Conference on Machine Learning, New York, 
NY, USA, 2016. JMLR: W&CP volume 48. 


In this paper, we develop two novel exact and approximate 
evasion algorithms for sum-ensemble of trees. Our exact 
(or optimal) evasion algorithm computes the smallest 5 ac¬ 
cording to the Lp norm for p = 0,1, 2, oo such that the 
model misclassifies x-\-5. The algorithm relies on a Mixed 
Integer Linear Program solver and enables precise quanti¬ 
tative robustness statements. We benchmark the robustness 
of boosted trees and random forests on a concrete hand¬ 
written digit classification task by comparing the minimal 
required perturbation 5 across many representative models. 
Those models include Li and L 2 regularized logistic re¬ 
gression, max-ensemble of linear classifiers (shallow max- 
out network), a 3-layer deep neural network and a classic 
RBF-SVM. The comparison shows that for this task, de¬ 
spite their competitive accuracies, tree ensembles are con¬ 
sistently the most brittle models across the board. 

Finally, our approximate evasion algorithm is based on 
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symbolic prediction, a fast and novel method for comput¬ 
ing finite differences for tree ensemble models. We use 
this method for generating more than 11 million synthetic 
confusing instances and incorporate those during gradi¬ 
ent boosting in an approach we call adversarial boosting. 
This technique produces a hardened model which is signif¬ 
icantly harder to evade without loss of accuracy. 


2. Related Work 

From the onset of the adversarial machine learning sub¬ 
field, evasion is recognized as part of the larger family of at¬ 
tacks occurring at inference time: exploratory attacks ( |Bar- 
reno et al.]|20Q6| ). While there is a prolific literature consid¬ 
ering the evasion of linear or otherwise differentiable mod- 


els dPalvi et aT|[20041 [20051 |Lowd & Meek|[2005|[Ner 


son et al.||2012[ [Bruckner et ni|2012HFawzi et ari|2014 


Biggio et al.[|2013[|Szegedy et al.||2013||Smdic & Laskov 

2Q14| ), we~^e only aware of a single paper tackling the case 

of tree ensembles. In Xu et al. ( |Xu et al.||2016| ), the authors 
present a genetic algorithm for finding malicious PDF in¬ 
stances which evade detection. 

In this paper, we forgo application-specific feature extrac¬ 
tion and directly work in feature space. We briefiy dis¬ 
cuss strategies for modeling the feature extraction step in 


paragraph additional constraints of section 4.3 We de¬ 
liberately do not limit the amount of information available 
for carrying out evasion. In this paper, our goal is to estab¬ 
lish the intrinsic evasion robustness of the machine learning 
models themselves, and thus provide a guaranteed worst- 
case lower-bound. In contrast to ( |Xu et al.||2016| ), our ex¬ 
act algorithm guarantees optimality of the solution, and our 
approximate algorithm performs a fast coordinate descent 
without the additional tuning and hyper-parameters that a 
genetic algorithm requires. 

We contrast our paper with a few related papers on deep 
neural networks, as these are the closest in spirit to the 
ideas developed here. Goodfellow et al. ( [Goodfellow et al.| 
2014| ) hypothesize that evasion in practical deep neural net¬ 
works is possible because these models are locally lin¬ 
ear. However, this paper demonstrates that despite their 
extreme non-linearity, boosted trees are even more sus¬ 
ceptible to evasion than neural networks. On the harden¬ 
ing side, Goodfellow et al. ( Goodfellow et al. 2014| ) in¬ 
troduce a regularization penalty term which simulates the 
presence of evading instances at training time, and show 
limited improvements in both test accuracy and robustness. 
Gu et al. ( |Gu & Rigazio[ |2Q15| ) show preliminary results 
by augmenting deep neural networks with a pre-filtering 
layer based on a form of contractive auto-encoding. Most 
recently, Papemot et al. ( [Papernot et al.| |2Q15| ) shows the 
strong positive effect of distillation on evasion robustness 
for neural networks. In this paper, we demonstrate a large 


increase in robustness for a boosted tree model hardened 
by adversarial boosting. We empirically show that our 
method does not degrade accuracy and creates the most ro¬ 
bust model in our benchmark problem. 


3. The Optimal Evasion Problem 


In this section, we formally introduce the optimal evasion 
problem and briefiy discuss its relevance to adversarial ma¬ 


chine learning. We follow the definition of ( Biggio et al.| 
2013| ). Let c : A’ ^ 3^ be a classifier. For a given instance 
X ^ X and a given “distance” function d : X x X ^ IR+, 
the optimal evasion problem is defined as: 


minimize x') subject to c(x) 7 ^ c(x') ( 1 ) 

x'ex 


In this paper, we focus on binary classifiers defined over 
an n-dimensional feature space, that is y = { — 1 , 1 } and 
A' C 

Setting the classifier c aside, the distance function d fully 
specifies ([^, hence we talk about d-evading instances, or d- 
robustness. In fact, many problems of interest in adversar¬ 
ial machine learning fit under formulation ^ with a judi¬ 
cious choice for d. In the adversarial learning perspective, 
d can be used to model the cost the attacker has to pay for 
changing her initial instance x. In this paper, we proceed 
as if this cost is decomposable over the feature dimensions. 
In particular, we present results for four representative dis¬ 
tances. We briefiy describe those and their typical effects 
on the solution of 0 . 

The I/O distance ^ or Hamming distance en¬ 

courages the sparsest, most localized deformations with ar¬ 
bitrary magnitude. Our optimal evasion algorithm can also 
handle the case of non-uniform costs over features. This 
situation corresponds to minimizing ^i\i^x'. where 
ai are non-negative weights. 


The Li distance \ encourages localized de¬ 

formations and additionally controls for their magnitude. 

The 1/2 distance ~ encourages less lo¬ 

calized but small deformations. 


The I/oo distance max^ \xi — x[\ encourages uniformly 
spread deformations with the smallest possible magnitude. 

Note that for binary-valued features, Li and L 2 reduce to 
I/O and I/Qo results in the trivial solution value 1 for 0 . 

4 . Evading Tree Ensemble Models 

We start by introducing tree ensemble models along with 
some useful notation. We then describe our optimal and ap- 
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proximate algorithms for generating evading instances on 
sum-ensembles of trees. 

4.1. Tree Ensembles 

A sum-ensemble of trees model / : M consists of 

a set T of regression trees. Without loss of generality, a 
regression tree T G T is a binary tree where each inter¬ 
nal node n G T.nodes holds a logical predicate n.predicate 
over the feature variables, outgoing node edges are by con¬ 
vention labeled n.true and n.false and finally each leaf 
I G T.leaves holds a numerical value /.prediction G M. 
For a given instance x e the prediction path in T is 
the path from the tree root to a leaf such that for each in¬ 
ternal node n in the path, n.true is also in the path if and 
only if n.predicate is true. The prediction of tree T is the 
leaf value of the prediction path. Finally, the signed margin 
prediction f{x) of the ensemble model is the sum of all in¬ 
dividual tree predictions and the predicted label is obtained 
by thresholding, with the threshold value commonly fixed 
at zero: c{x) = 1 f{x) > 0. 

In this paper, we consider the case of single-feature thresh¬ 
old predicates of the form Xi < r ox equivalently Xi > r, 
where 0 < i < n and r G M are fixed model parame¬ 
ters. This restriction excludes oblique decision trees where 
predicates simultaneously involve several feature variables. 
We however note that oblique trees are seldom used in en¬ 
semble classifiers, partially because of their relatively high 
construction cost and complexity ( [Norouzi et al.[ |2015| ). 
Before describing our generic approach for solving the op¬ 
timal evasion problem, we first state a simple worst-case 
complexity result for problem Q. 


responding to the falseness of the clause. For this path, the 
prediction value of the leaf is set to the opposite of the num¬ 
ber of clauses in S, which is also the number of trees in the 
reduction. The remaining leaves predictions are set to 1. 
Figureillustrates this construction on an example. 



Figure 1. Regression tree for the clause xq V -ixi V X 2 . In this 
example, S has 13 clauses. 

It is easy to see that S is satisfiable if and only if there exists 
X such that f{x) > 0. Indeed, a satisfying assignment 
for S corresponds to x such that f{x) = \T\ > 0 and 
any non-satisfying assignment for S corresponds to x such 
that f{x) < —1 < 0 because there is at least one false 
clause which corresponds to a regression tree which output 

is-in. 

While we can not expect an efficient algorithm for solving 
all instances of problem Q unless P = NP, it may be the 
case that tree ensemble models as produced by common 
learners such as gradient boosting or random forests are 
practically easy to evade. We now turn to an algorithm for 
optimally solving the evasion problem when d is one of the 
distances presented in section 

4.3. Optimal Evasion 


4.2. Theoretical Hardness of Evasion 

For a given tree ensemble model /, finding an x G such 
that f{x) > 0 (or f{x) < 0 without loss of generality) is 
NP-complete. That is, irrespectively of the choice for d, 
the optimal evasion problem Q requires solving an NP- 
complete feasibility subproblem. 

We now give a proof of this fact by reduction from 3-SAT. 
First, given an instance x, computing the sign of f{x) can 
be done in time at most proportional to the model size. 
Thus the feasibility problem is in NP. It is further NP- 
complete by a linear time reduction from 3-SAT as fol¬ 
lows. We encode in x the assignment of values to the vari¬ 
ables of the 3-SAT instance S. By convention, we choose 
Xi > 0.5 if and only if variable i is set to true in S. Next, 
we construct / by arranging each clause of S' as a binary 
regression tree. Each regression tree has exactly one inter¬ 
nal node per level, one for each variable appearing in the 
clause. Each internal node holds a predicate of the form 
Xi > 0.5 where i is a clause variable. The nodes are ar¬ 
ranged such that there exists a unique prediction path cor¬ 


Let / be a sum-ensemble of trees as defined in |4.1| and 
X G an initial instance. We present a reduction of 
problem Q into a Mixed Integer Linear Program (MILP). 
This reduction avoids introducing constraints with so called 
“big-M” constants ( |Griva et~ar| |2008| ) at the cost of a 
slightly more complex solution encoding. We experimen¬ 
tally find that our reduction produces tight formulations and 
acceptable running times for all common models /. 


In what follows, we present the mixed integer program by 
defining three groups of MILP variables: the predicate 
variables encode the state (true or false) of all predicates, 
the leaf variables encode which prediction leaf is active in 
each tree, and the optional objective variable for the case 
where d is the Loo norm. 


We then introduce three families of constraints: the pred¬ 
icates consistency constraints enforce the logical con¬ 
sistency between predicates, the leaves consistency con¬ 
straints enforce the logical consistency between prediction 
leaves and predicates, and the model mislabel constraint 
enforces the condition c{x) ^ c{x'), or equivalently that 
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Figure 2. Regression tree for the reduction example. Predicate 
variables p and leaf variables / are shown next to their correspond¬ 
ing internal and leaf nodes. There are n = 2 continuous features. 
The leaf predictions are -2, 1, 1 and 2. 


Xk < X 2 , then and pj can take inconsistent values with¬ 
out additional constraints. For instance, if ri < r 2 , then 
Pi = I and Pj = 0 would be logically inconsistent because 
Xk < ^ Xk < T 2 , but any other valuation is possible. 

For each feature variable we can ensure the consis¬ 
tency of all p variables which reference a predicate over x'^ 
by adding K — 1 inequalities enforcing the implicit impli¬ 
cation constraints between the predicates, where K is the 
number of p variables referencing Xk. For a given let 
Ti < • • • < Tx be the sorted thresholds of the predicates 
over Let Pi ,... be the MILP variables correspond¬ 
ing to predicates < ri,..., < tk- A valuation of 

{Pi)i=i..K is consistent if and only if = 1 ^ ^ 

Pj^ = 1. Thus the consistency constraints are: 

Pi <P2 < • <Pk 


f{x') > 0 or f{x') < 0 depending on the sign of f{x). 
Finally we reduce the objective of O by relating the pred¬ 
icate variables to the value of d{x^x') in objective. 

Program Variables For clarity, MILP variables are 
bolded and italicized throughout. Our reduction uses three 
families of variables. 

• Almost IT.nodesI binary variables/?^ G {0; 1} 

(predicates) encoding the state of the predicates. Our 
implementation sparingly create those variables: if 
any two or more predicates in the model are logically 
equivalent, their state is represented by a single vari¬ 
able. For example, the state of x'^ < 0 and —x'^ > 0 
would be represented by the same variable. 

• StgT l^-l^^vesj continuous variables 0 < /^ < 1 
(leaves) encoding which prediction leaf is active in 
each tree. The MILP constraints force exactly one li 
per tree to be non-zero with k = 1. The / variables are 
thus implied binary in any solution but are nonethe¬ 
less typed continuous to narrow down the choice 
of branching variable candidates during branch-and- 
bound, and hence improve solving time. 

• At most 1 non-negative continuous variable b (bound) 
for expressing the distance d{x^x') of problem Q 
when d is the distance. This variable is first used 
in the objective paragraph. 

In what follows, we illustrate our reduction by using a 
model with a single regression tree as represented in fig¬ 
ure | 2 l 

Predicates consistency Without loss of generality, each 
predicate variable Pi corresponds to the state of a predicate 
of the form Xk < Tk. If two variables Pi and pj corre¬ 
spond to predicates over the same variable Xk < ti and 


When the feature variables x'^ are binary-valued, there is a 
single Pi variable associated to a feature variable: all pred¬ 
icates x'k < T with 0 < r < 1 are equivalent. Generally, 
tree building packages generate a threshold of 0.5 in this 
situation. This is however implementation dependent and 
we can simplify the formulation with additional knowledge 
of the value domain x'^ is allowed to take. 

In our toy example in figure]^ variables p^ and p^ refer to 
the same feature dimension 0 and are not independent. The 
predicate consistency constraint in this case is: 

Pi <Po 

and no other predicate consistency constraint is needed. 

Leaves consistency These constraints bind the p and / 
variables so that the semantics of the regression trees are 
preserved. Each regression tree has its own independent 
set of leaves consistency constraints. We construct the con¬ 
straints such that the following properties hold: 

(i) if Ik = 1 , then every other li^k variable within the 
same tree is zero, and 

(ii) if a leaf variable //.is 1 , then all predicate variables Pi 
encountered in the prediction path of the correspond¬ 
ing leaf are forced to be either 0 or 1 in accordance 
with the semantics of the prediction path, and 

(iii) exactly one Ik variable per tree is equal to 1. This 
property is needed because (i) does not force any li to 
be non-zero. 

Enforcing property (i) is done using a classic exclusion 
constraint. If /i,..., /^ are the Ff leaf variables for a given 
tree, then the following equality constraint enforces (i): 

li 12 F ' ’' Ik — 1 


( 2 ) 
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For our toy example, this constraint is: 

/l + /2 + /s + /4 = 1 

Enforcing property (ii) requires two constraints per internal 
node. Let us start at the root node r. Letbe the variable 
corresponding to the root predicate. Let ,..., /f be the 
variables corresponding to the leaves of the subtree rooted 
at r.true, and /f,..., the variables for the subtree rooted 
at r.false. The root predicate is true if and only if the active 
prediction leaf belongs to the subtree rooted at r.true. In 
terms of the MILP reduction, this means that is equal 
to 1 if and only if one of the leaf variables of the true subtree 
is set to one. Similarly on the false subtree, p^^^^ is 0 if 
and only if one of the leaf variables of the false subtree is 
set to one. Because only one leaf can be non-zero, these 
constraints can be written as: 



The case of internal nodes is identical, except that if and 
only ifs are weakened to single side implications. Indeed, 
unlike the root case, it is possible that no leaf in either sub¬ 
tree might be an active prediction leaf. For an internal node 
n, let /?node the variable attached to the node, F and 
the variables attached to leaves of the true and false sub¬ 
trees rooted at n.true and n.false. The constraints are: 

1 H-^ ^ Pnode > H—^ 

In our toy example, we have 3 internal nodes and thus six 
constraints. The constraints associated with the root, the 
leftmost and rightmost internal nodes are respectively: 

h h = Po ~ ^ ~ ih + h) 

< Pi < 1 — ^2 

h ^P2 ~ h 

Finally, property (iii) automatically holds given the previ¬ 
ously defined constraints. To see this, one can walk down 
the prediction path defined by the p variables and notice 
that at each level, the leaves values of one of the subtree 
rooted at the current node must be all zero. For instance, if 
Pnode = 1’ then we have 

zf + zf + ■ ■ ■ + Zf < 0 ^ /f = /f = ■ • ■ = /f = 0 

At the last internal node, exactly two leaf variables remain 
unconstrained, and one of them is pushed to zero. By the 
exclusion constraint Q, the remaining leaf variable must 
be set to 1. 


Model mislabel Without loss of generality, consider an 
original instance x such that f{x) < 0. In order for x' to 
be an evading instance, we must have f{x') > 0. Encod¬ 
ing the model output f{x') is straightforward given the leaf 
variables /. The output of each regression tree is simply 
the weighted sum of its leaf variables, where the weight of 
each variable li corresponds to the prediction value Vi of the 
associated leaf. Hence, f{x') is the sum of \T\ weighted 
sums over the 1 variables and the following constraint en¬ 
forces f{x') > 0: 

Vik > 0 
i 

For our running example, the mislabeling constraint is: 

— ‘111 “h ^2 ~ ^3 + 2/4 > 0 

Objective Finally, we need to translate the objective 
d{x^x') of problem Q- We rely on the predicate variables 
p in doing so. For any distance Lp with p G N, there exists 
weights Wi and a constant C such that the MILP objective 
can be written as: 

y] wiPi+c 

i 

We now describe the construction of {wi)i and C. Recall 
that for each feature dimension 1 < /c < n, we have a 
collection of predicate variables (Pi)i=i..K associated with 
predicates < ri,..., < tk where the thresholds are 

sorted ri < • • • < r^. Thus, the p variables effectively 
encode the interval to which belongs to, and any feature 
value within the interval will lead to the same prediction 
f{x'). There are exactly K distinct possible valuations 
for the binary variables p^ <P2 < ••• <Pk value 

domain mapping 0 ^ (R U {—00; 00})^ is: 

4 e 4>ip) = [n^n+i) 

i = m8iyi{k\pj^ = 0^0 < k < K 1} 

where by convention p^ = 0, pj^_^i = 1 and tq = — 00, 
Tx+i = oc. Setting aside the Loo case for now, consider 
p G N the norm we are interested in for d. Instead of di¬ 
rectly minimizing ||x — x'||p, our formulation equivalently 
minimizes ||x — x' ||^. By minimizing the latter, we are able 
to consider the contributions of each feature dimension in¬ 
dependently: 

n 

\\x-x'Vp = ^\xk-x'^,\P 
k=l 

We take 0^ = 0 by convention. At the optimal solution, 
\xk — x'^\P can only take iG + 1 distinct values. Indeed, 
if x'j^ and x^ belong to the same interval, then x'j^ = Xk 
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minimizes the distance along feature k, and this distance is 
zero. If x'f^ and Xk do not belong to same interval, then set¬ 
ting x'f^ at the border of 0(p) that is closest to Xk minimizes 
the distance along k. If 0(/?) = [ri,T^+i), this distance is 
simply equal to min{|x/c — ri\^,\xk — Note that 

because of the right-open interval, the minimum distance is 
actually an infimum. In our implementation, we simply use 
a guard value e = 10“^ of the same magnitude order than 
the numerical tolerance of the MILP solver. 

Hence, we can express the minimization objective of prob¬ 
lem (1) as a weighted sum of p variables without loss of 
generality. Let 0 < j < if+1 be the indices such that Xk G 
[rj,rj+i). Let (rCi)i=o..K+i such that for any valid valu- 
ation ofp we have YhJ'q WiPi = \xk - x'^\p. 

By the discussion above and exhaustively enumerating the 
if -f 1 valuations of p, w is the solution to the following 
if -|- 1 equations: 


For the Loo case, our objective reduces to the variable b 
and we introduce n additional bounding constraints of the 
form ■ • < b where the left hand side measures \xk — x'j^\ 
using the same technique as the p = 1 case. 

Hence, the full MILP reduction of the optimal Lq- evasion 
for our toy instance is: 


iiiiii ± 


I'l I 


p,i 

s.t. Po,p^ e {0; 1};0 < li,l 2 ,h,h < 1 

Pi <Pq predicates consistency 

li 12 Is 14 = f 
h h = Po = f ~ {h Ia) 
h <Pi <l - h 

h ^P2 ^ ^ ~ h 
— 2/1 I 2 ~ Is 2/4 > 0 


leaves consistency 
leaves consistency 
leaves consistency 
leaves consistency 
model mislabel 


Wk+ 1 = \Xk - 
Wk + Wk-\-1 = \Xk — Xk-i\^ 

Wj^l H-h Wk+1 = \Xk - Tj+11^ 

Wj + Wj^i H-h wk+1 = 0 

Wj-I -h Wj -h Wj^i H-h wk+ 1 = \xk - Tj - e\^ 

Wi^W 2 ^Ws-\ -h Wk +1 = \xk -T2- e\P 

Wo^Wi^W 2 ^Ws^ -h Wk +1 = \xk -Ti- e\^ 

Note that this system of linear equations is already in tri¬ 
angular form and obtaining the w values is immediate. To 
obtain the full MILP objective, we repeat this process for 
every feature 1 < k < n and take the sum of all weighted 
sums of subsets of p. 

Finally, for the Loo case, we use 1 continuous variable 
b. We introduce n additional constraints to the formula¬ 
tion, one for each feature dimension k. As per the pre¬ 
vious discussion, we can generate the weights w such that 
'^iPi = \xk-x'f,\ (this is the p = 1 case). 

The additional constraint on dimension k is then: 

K+i 

LI ^iPi - * 

i=0 

and the MILP objective is simply the variable b itself. 

For our toy example, consider (xq = 0,xi = 3). In the 
case of the Lq distance, we have the following objective: 

1 -Pl +P2 

For the (squared) L2 distance instead, the objective is es¬ 
sentially: 

4-3j7o ~Pi+4p2 


Additional Constraints Reducing problem Q to a 
MILP allows expressing potentially complex inter-feature 
dependencies created by the feature extraction step. For in¬ 
stance, consider the common case of if mutually exclusive 
binary features xi,..., xk such that in any well-formed 
instance, exactly one feature is non-zero. Letting p- be the 
predicate variable associated with Xi < 0.5, mutual exclu¬ 
sivity can be enforced by: 

K 

Yy.=K-i 

4.4. Approximate Evasion 

While the above reduction of problem Q to an MILP is lin¬ 
ear in the size of the model /, the actual solving time can 
be very significant for difficult models. Thus, as a comple¬ 
ment to the exact method, we develop an approximate eva¬ 
sion algorithm to generate good quality evading instances. 
For this part, we exclusively focus on minimizing the Lq 
distance. Our approximate evasion algorithm is based on 
the iterative coordinate descent procedure described in al¬ 
gorithmic 

Algorithm 1 Coordinate Descent for Problem Q 

Input: model /, initial instance x (assume f{x) < 0) 
Output: evading instance x' such that f{x') > 0 
x' ^ X 

while f{x') < 0 do 

x' ^ argmax f{x') 

x':\\x'- x'\\q = 1 

end while 


In essence, this algorithm greedily modifies the single best 
feature at each iteration until the sign of f{x') changes. 






Evasion and Hardening of Tree Ensemble Classifiers 


We now present an efficient algorithm for solving the inner 
optimization subproblem 

max f{x) (3) 

x:\\x-x\\o = l 

The time complexity of a careful brute force approach is 
high. For balanced regression trees, the prediction time 
for a given instance is O (J]]^^^log |T.nodes|). Fur¬ 
ther, for each dimension 1 < ^ < n, we must com¬ 
pute all possible values of f{x) where x and x only dif¬ 
fer along dimension k. Note that because the model pred¬ 
icates effectively discretize the feature space, f{x) takes 
a finite number of distinct values. This number is no 
more than one plus the total number of predicates hold¬ 
ing over feature k. Hence, we must compute f{x) for a 
total of l^-iiodesl candidates x and the total run¬ 
ning time is O |T.nodes| x |T.nodes|). 

If we denote by |/| the size of the model which is propor¬ 
tional to the total number of predicates, the running time is 

O ^l/l |T| log . Tree ensembles often have thousands 

of trees, making the |/||T| dependency prohibitively ex¬ 
pensive. 

We can efficiently solve problem by a dynamic pro¬ 
gramming approach. The main idea is to visit each internal 
node no more than once by computing what value of x can 
land us at each node. We call this approach symbolic pre¬ 
diction in reference to symbolic program execution ( |King| 
1976| ), because we essentially move a symbolic instance x 
down the regression tree and keep track of the constraints 
imposed on x by all encountered predicates. Because we 
are only interested in x instances that are at most one fea¬ 
ture away from x, we can stop the tree exploration early if 
the current constraints imply that at least two dimensions 
need to be modified or more trivially, if there is no instance 
X that can simultaneously satisfy all the constraints. When 
reaching a leaf, we report the leaf prediction value f{x) 
along with the pair of perturbed dimension number k and 
value interval for Xk which would reach the given leaf. 

To simplify the presentation of the algorithm, we introduce 
a SymbolicInstance data structure which keeps track 
of the constraints on x. This structure is initialized by x 
and has four methods. 

• For a predicate p, .isFeasible(j 9) returns true if and 
only if there exists an instance x such that ||x — x||o < 
1 and all constraints including p hold. 

• .UPDATE(p) updates the set of constraints on x by 
adding predicate p. 

• .isChangedO returns true if and only if the current 
set of constraints imply x ^ x. 


• .getPerturbationO returns the index k such that 
Xk 7^ Xk and the admissible interval of values for Xk 

It is possible to implement SymbolicInstance such that 
each method executes in constant time. 

Algorithm|^presents the symbolic prediction algorithm re¬ 
cursively for a given tree. It updates a list of elements by 
appending tuples to it. The first element of a tuple is the 
feature index k where Xk Xk, the second element is the 
allowed right-open interval for Xk, and the last element is 
the prediction score f{x). 


Algorithm 2 Recursive definition of the symbolic predic¬ 
tion algorithm. For the first call, n is the tree root, s is a 
fresh SymbolicInstance object initialized on x with no 
additional constraints and I is an empty list. 

Input: node n (either internal or leaf) 

Input: s of type SymbolicInstance 
Input/Output: list of tuples I (see description) 
if n is a leaf then 

if 5.isChanged() then 
I ^ /U{s. getPerturbationO, n.prediction} 
end if 
else 

if s.isFEASiBLE(n.predicate) then 
St ^ copy(5) 

ST.UPDATE(n.predicate) 
SYMBOLICPREDICTION(n.true, St, 1) 

end if 

if 5.isFEASiBLE(-in.predicate) then 
5.UPDATE(-in.predicate) 
SYMBOLICPREDICTION(n.false, S, 1) 

end if 
end if 


This algorithm visits each node at most once and per¬ 
forms at most one copy of the SymbolicInstance s 
per visit. The copy operation is proportional to the num¬ 
ber of constraints in s. For a balanced tree T, the copy 
cost is 0(log |T.nodes|), so that the total running time is 
0(|T.nodes| log |T.nodes|). 

For each tree of the model, once the list of dimension- 
interval-prediction tuples is obtained, we substract the leaf 
prediction value for x from all predictions in order to obtain 
a score variation between x and x instead of the score for x. 
With the help of an additional data structure, we can use the 
dimension-interval-variation tuples across all trees to find 
the dimension k and interval for Xk which corresponds to 
the highest variation f{x) — f{x). This final search can be 
done in 0{L log L), where L is the total number of tuples, 
and is no larger than l^-l^^vesj by construction. To 

summarize, the time complexity of our method for solving 
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problem © is O (I /1 log I /1), an exponential improvement 
over the brute force method. 

5. Results 

We turn to the experimental evaluation of the robustness 
of tree ensembles. We start by describing the evaluation 
dataset and our choice of models for benchmarking pur¬ 
poses before moving to a quantitative comparison of the ro¬ 
bustness of boosted trees and random forest models against 
a garden variety of learning algorithms. We finally show 
that the brittleness of boosted trees can be effectively ad¬ 
dressed by including fresh evading instances in the training 
set during boosting. 


Model 

Parameters 

Test Error 

Lin. Li 

C = 0.5 

1.5% 

Lin. 1/2 

C = 0.2 

1.5% 

BDT 

1,000 trees, depth 4, ry = 0.02 

0.25% 

RF 

80 trees, max. depth 22 

0.20% 

CPM 

fc = 30, C = 0.01 

0.20% 

NN 

60-60-30 sigmoidal (tanh) units 

0.25% 

RBF-SVM 

7 = 0.04, C = 1 

0.25% 

BDT-R 

1,000 trees, depth 6,77 = 0.01 

0.20% 


Table 1. The considered models. BDT-R is the hardened boosted 
trees model introduced in section |5^ 


5.1. Dataset and Method 


We choose digit recognition over the MNIST ( LeCun et al.| ) 
dataset as our benchmark classification task for three rea¬ 
sons. First, the MNIST dataset is well studied and exempt 
from labeling errors. Second, there is a one-to-one map¬ 
ping between pixels and features, so that features can vary 
independently from each other. Third, we can pictorially 
represent evading instances, and this helps understanding 
the models’ robustness or lack of. Our running binary clas¬ 
sification task is to distinguish between handwritten digits 
“2” and “6”. Our training and testing sets respectively in¬ 
clude 11,876 and 1,990 images and each image has 28 x 28 
gray scale pixels and our feature space is ^ = [0, 

As our main goal is not to compare model accuracies, but 
rather to obtain the best possible model for each model 
class, we tune the hyper-parameters so as to minimize the 
error on the testing set directly. In addition to the training 
and testing sets, we create an evaluation dataset of a hun¬ 
dred instances from the testing set such that every instance 
is correctly classified by all of the benchmarked models. 
These correctly classified instances are to serve the purpose 
of X, the starting point instances in the evasion problem Q. 


5.2. Considered Models 


Table \T\ summarizes the 7 benchmarked models with their 
salient hyper-parameters and error rates on the testing set. 
For our tree ensembles, BDT is a (gradient) boosted de¬ 
cision trees model in the modern XGBoost implementa¬ 
tion ( [Chen & and RF is a random forest trained us¬ 
ing scikit-leam ( [Buitinck et al.| 2Q13| ). We also include the 
following models for comparison purposes. Lin. Li and 
Lin. 1/2 are respectively a Li and L2-regularized logis¬ 
tic regression using the LibLinear ( [Fan et al.' 2008| ) imple¬ 
mentation. RBF-SVM is a regular Gaussian kernel SVM 
trained using LibSVM ( [Chang & Lin| [2011] ). NN is a 3 
hidden layer neural network with a top logistic regression 
layer implemented using Theano ( jBergstra et al.j |2010| ) 
(no pre-training, no drop-out). Finally, our last benchmark 
model is the equivalent of a shallow neural network made 
of two max-out units (one unit for each class) each made 
of thirty linear classifiers. This model corresponds to the 
difference of two Convex Polytope Machines ( jKantchelian 
jet al.j |2014j ) (one for each class) and we use the authors’ 
implementation (CPM). Two factors motivate the choice of 
CPM. First, previous work has theoretically considered the 
evasion robustness of such ensemble of linear classifiers 
and proved the problem to be NP-hard ( [Stevens & Lowdj 
[2013[ ). Second, unlike RBF-SVM and NN, this model can 
be readily reduced to a Mixed Integer Program, enabling 
optimal evasions thanks to a MIP solver. As the reduction 
is considerably simpler than the one presented for tree en¬ 
sembles above, we omit it here. Except for the two linear 
classifiers, all models have a comparable, very low error 
rate on the benchmark task. 


Lin. Lj BDT RF Lin. Lj CPM NN RBF-SVM BDT-R 
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Figure 4. First 4 rows: examples of optimal or best effort evading 
“6” instances. Every picture is misclassified as “2” by its col¬ 
umn model. Last row: feature importance computed as frequency 
of pixel modification in the Lo-evasions (darker means feature is 
more often picked). 

5.3. Robustness 

For each learned model, and for all of the 100 correctly 
classified evaluation instance, we compute the optimal (or 
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Figure 3. Optimal (white boxes) or best-effort (gray boxes) evasion bounds for different metrics on the evaluation dataset. The smallest 
bounds, 25-50% and 50-75% quartiles and largest bounds are shown. The red line is the median score. Larger scores mean more 
deformations are necessary to change the model prediction. 


best effort) solution to the evasion problem under all of 
the deformation metrics. We use the Gurobi ( [Gurobi Op 


timization 2015) solver to compute the optimal evasions 


for all distances and all models but NN and RBF-SVM. We 
use a classic projected gradient descent method for solving 
the and Loo evasions of NN and RBF-SVM, and 

address the Lq- evasion case by an iterative coordinate de¬ 
scent algorithm and a brute force grid search at each itera¬ 
tion. Figure [^summarizes the obtained adversarial bounds 
as one boxplot for each combination of model and dis¬ 
tance. Although the tree ensembles BDT and RF have very 
competitive accuracies, they systematically rank at the bot¬ 
tom for robustness across all metrics. Remarkably, negli¬ 
gible Li or L2 perturbations suffice to evade those mod¬ 
els. RBF-SVM is apparently the hardest model to evade, 
agreeing with results from ( [Goodfellow et al.[ |2Q14| ). NN 
and CPM exhibit very similar performance despite having 
quite different architectures. Finally, the Li-regularized 
linear model exhibits significantly more brittleness than its 
L2 counterpart. This phenomenon is explained by large 
weights concentrating in specific dimensions as a result of 
sparsity. Thus, small modifications in the heavily weighted 
model dimensions can result in large classifier output vari¬ 
ations. 


creasing the size of the training set by a factor 2. Finally, 
gradient boosting produces the next regression tree which 
by definition minimizes the error of the augmented ensem¬ 
ble model on the adversarially-enriched training set. After 
1,000 adversarial boosting rounds, our model has encoun¬ 
tered more than 11 million adversarial instances, without 
ever training on more than 24,000 instances at a time. 

We found that we needed to increase the maximum tree 
depth from 4 to 6 in order to obtain an acceptable error 
rate. After 1,000 iterations, the resulting model BDT-R 
has a slightly higher testing accuracy than BDT (see Ta¬ 
ble [T]). Unlike BDT, BDT-R is extremely challenging to 
optimally evade using the MILP solver: the branch-and- 
bound search continues to expand nodes after 1 day on a 
6 core Xeon 3.2GHz machine. To obtain the tightest pos¬ 
sible evasion bound, we warm-start the solver with the so¬ 
lution found by the fast evasion technique and report the 
best solution found by the solver after an hour. Figure 
shows that BDT-R is more robust than our previous cham¬ 
pion RBF-SVM with respect to Lq deformations. Unfor¬ 
tunately, we found significantly lower scores on all Li, L2 
and Loo distances compared to the original BDT model: 
hardening against Lq- evasions made the model more sen¬ 
sitive to all other types of evasions. 


5.4. Hardening by Adversarial Boosting 

We empirically demonstrate how to significantly improve 
the robustness of the BDT model by adding evading in¬ 
stances to the training set during the boosting process. 
At each boosting round, we use our fast symbolic predic- 
tion-hdi^ed algorithm to create budgeted “adversarial” in¬ 
stances with respect to the current model and for all the 
11,876 original training instances. For a given training in¬ 
stance X with label y and a modification budget L > 1, 
a budgeted “adversarial” training instance x* is such that 
11^ — ||o < B and the margin yf{x'^) is as small as possi¬ 

ble. Here, we use B = 28, the size of the picture diagonal, 
as our budget. The reason is that modifying 28 pixels over 
784 is not enough to perceptually morph a handwritten “2” 
into “6”. The training dataset for the current round is then 
formed by appending to the original training dataset these 
evading instances along with their correct labels, thus in¬ 


6. Conclusion 

We have presented two novel algorithms, one exact and one 
approximate, for systematically computing evasions of tree 
ensembles such as boosted trees and random forests. On 
a classic digit recognition task, both gradient boosted trees 
and random forests are extremely susceptible to evasions. 
We also introduce adversarial boosting and show that it 
trains models that are hard to evade, without sacrificing ac¬ 
curacy. One future work direction would be to use these 
algorithms to generate “small” evading instances for prac¬ 
tical security systems. Another direction would be to better 
understand the properties of adversarial boosting. In par¬ 
ticular, it is not known whether this hardening approach 
would succeed on all possible datasets. 
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